# -.-|m { input: true, output_error: false, input_fold: hide }
# load pretty jupyter
%load_ext pretty_jupyter

Introduction

This simulation is split up into a few parts. These parts can be broken down into the following: 1) Data Collection 2) Data Processing 3) Statistical Analysis 4) Simulating Games

The data collection step consists of reading in information from play-by-play data. This is necessary because the simulation attempts to simulate a game by estimating the result of each possession between two teams. In order to get accurate data for each team's possession results, play-by-play data is parsed.

The data processing step takes the collected data and prepares it to be used in the statistical analysis step. The raw data originally collected is interesting to look at, but not very useful in any kind of data science sense. It consists of season level data to look at how teams compare to each other over the course of the whole season, and game level data used to see how individual teams compared in a head to head matchup. The data processing step mainly looks at the game level data. It uses an arbitrary data that is around halfway into the season, and compiles the sum of the data before that point. For every game after that date, it updates the sum of the previous game stats, and also notes the stats from the current game. The data is then altered so that instead of listing how many possession results occured in the game, a row for each possession is included in the dataset. This makes the result categorical, so that it can be more easily analyzed in the next step.

The statistical analysis step is relatively simlpe. A multinomial logistic regression is performed on the data. The regression takes the offense's and defense's previous probabilities as inputs, and tries to come up with the probability for every possession result. This is done with SciKitLearn's LogisticRegression class.

After the statistical analysis is concluded, there is now a model that can take two teams' possession probabilities, and come up with expected results for their possessions. These probabilities are fed into a simulation which simulates a game possession by possession. Many of these games are simulated, their results are all listed, as well as their average score.

Data collection

For the data collection step, a number of functions are used to handle the different possibilities of a possession. The possibilities that pertain to the possession count consist of shots, rebounds, turnovers, and fouls. Free throws are also looked at to help simulate scores more accurately.

pandas will be used for lots of the work

import pandas as pd

# get rid of the max display columns so it is always possible to see all team statistics
pd.set_option('display.max_columns', None)

Handle Game

The handle game function is used to find all of the useful statistics necessary for simulating a game. It does this by reading through the play-by-play data for an individual game, and determining what kind of play each row is describing. If the kind of play is relevant to the simulation statistics, helper functions are used to break down the contents of that play.

def handle_game(group_data, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):    
    
    # array to hold the dictionaries containing the team stats
    stats = [home_team_poss_res, away_team_poss_res]
    

    for play in group_data.itertuples():

        play_type = play.type_text

        if "Shot" in play_type and play_type != "Block Shot":
            stats = handle_shot(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res)

        elif "Rebound" in play_type:
            stats = handle_rebound(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res)

        elif "Turnover" in play_type:
            stats = handle_turnover(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res)

        elif "Foul" in play_type:
            stats = handle_foul(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res)

        elif "FreeThrow" in play_type:
            stats = handle_free_throw(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res)
    
    return [stats[0], stats[1]]

Handle shot

The handle shot function is used to determine what happens after a shot in the game. It works as following:

Determine if the shot was made. If the shot was made, determine whether it was a two-point shot or a three-point shot. Adjust the team statistics accodingly. (Indicate what kind of shot was made, and increment the team's possession count)

If the shot was missed, increment the team's missed field goal count

def handle_shot(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
    
    if play.scoring_play:
        if play.score_value == 3:
            if play.team_id == home_team_id:
                home_team_poss_res['thr_fgm'] += 1
                home_team_poss_res['poss'] += 1
            elif play.team_id == away_team_id:
                away_team_poss_res['thr_fgm'] += 1
                away_team_poss_res['poss'] += 1
    
        elif play.score_value == 2:
            if play.team_id == home_team_id:
                home_team_poss_res['two_fgm'] += 1
                home_team_poss_res['poss'] += 1
            elif play.team_id == away_team_id:
                away_team_poss_res['two_fgm'] += 1
                away_team_poss_res['poss'] += 1
                
    # for plays that are not scoring plays                
    else:
        if play.team_id == home_team_id:
            home_team_poss_res['fg_miss'] += 1
        if play.team_id == away_team_id:
            away_team_poss_res['fg_miss'] += 1
        
                    
    return [home_team_poss_res, away_team_poss_res]

Handle Rebound

The handle shot function is used to determine the results of a rebound.

The first thing that is checked is if the rebound was an offensive rebound or a defensive rebound. If the rebound was an offensive rebound, the team that got the rebound is determined, and that team's offensive rebound count and possession count are incremented. While this is not the standard method for counting possessions, their possession count is incremented in this project to more accurately reflect how the probabilities for each result of their possession. (Elaborate here)

The procedure is slightly different for defensive rebounds. The function determines which team got the rebound, and the other team's possession count is incremented. This is because this project counts possessions at the end of the possession (rather than the start). A defensive rebound is due to the other team missing a shot, and ending their possession. Thus, defensive rebounds result in the opposing team's possession count being incremented.

def handle_rebound(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):  

    if "Defensive" in play.text:
        if play.team_id == home_team_id:
            away_team_poss_res['poss'] += 1
        elif play.team_id == away_team_id:
            home_team_poss_res['poss'] += 1
            
    elif "Offensive" in play.text:
        if play.team_id == home_team_id:
            home_team_poss_res['oreb'] += 1
            home_team_poss_res['poss'] += 1
                
        elif play.team_id == away_team_id:
            away_team_poss_res['oreb'] += 1
            away_team_poss_res['poss'] += 1

    return [home_team_poss_res, away_team_poss_res]

Handle turnover

The handle turnover function is simple to follow. It is determined which team committed the turnover. As a turnover means that team loses their possession, that team's turnover count and possession count are then incremented.

def handle_turnover(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
    
    if play.team_id == home_team_id:
        home_team_poss_res['tov'] += 1
        home_team_poss_res['poss'] += 1
    elif play.team_id == away_team_id:
        away_team_poss_res['tov'] += 1
        away_team_poss_res['poss'] += 1
    
    return [home_team_poss_res, away_team_poss_res]        

Handle foul

The handle foul functions is important for tracking possessions that did not end in a shot. Check which team committed the foul, and update the opposing team's stats (the team that was fouled). The team that was fouled gets their possession count incremented, as well as the count for how many times they were fouled

def handle_foul(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
    
    if play.team_id == home_team_id:
        away_team_poss_res['got_fouled'] += 1
        away_team_poss_res['poss'] += 1
    elif play.team_id == away_team_id:
        home_team_poss_res['got_fouled'] += 1
        home_team_poss_res['poss'] += 1 
     
    return [home_team_poss_res, away_team_poss_res]    

Handle free throw

The handle free throw function does nothing to affect possession counts; it is simply used to keep track of a team's free throw percentage for sake of the simulation. The team is determined, and either their free throws made or free throws missed are incremented accordingly.

def handle_free_throw(play, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res):
        
    if play.scoring_play:
        if play.team_id == home_team_id: 
            home_team_poss_res['ft'] += 1
        elif play.team_id == away_team_id:
            away_team_poss_res['ft'] += 1
    elif not play.scoring_play:
        if play.team_id == home_team_id: 
            home_team_poss_res['ft_miss'] += 1
        elif play.team_id == away_team_id:
            away_team_poss_res['ft_miss'] += 1
        
    return [home_team_poss_res, away_team_poss_res]

Update stat_list

The update stat list function is used to keep track of team's stats across many games. The stat_list structure is a dictionary of dictionarys. The outer dictionary consists of team names as keys, and their statistics as values. The inner dictionaries consist of team statistics as keys, and the actual counts as values.

This function works by first determining whether or not a team is already in the dictionary. If it is not, it enters their stats from their first game. Otherwise, it uses more current game data to add to the team's total stat count across the whole season.

def update_stat_list(team_stat_list, home_team_name, away_team_name, home_team_poss_res, away_team_poss_res):
    
    # update stats for home team
    if home_team_name in team_stat_list:
        for key, value in home_team_poss_res.items():
            if type(value) != str:
                team_stat_list[home_team_name][key] += value
    else:
        team_stat_list[home_team_name] = home_team_poss_res
        
    # update stats for away team
    if away_team_name in team_stat_list:
        for key, value in away_team_poss_res.items():
            if type(value) != str:
                team_stat_list[away_team_name][key] += value
    else:
        team_stat_list[away_team_name] = away_team_poss_res
    
    return team_stat_list

Update opp list

The update opp list function effectively tracks a team's defense. The first thing that is done is a renaming. By taking one team's name and adding "_defense" to it, and then attaching that to the other team's stats, the result is the first team's defensive statistics.

Besides the switching of the team names, the rest of this function operates exactly the same as the update stat list function. Check if a team exists in the list, if so update their stats accordingly. If they don't already exist in the list, initialize their stat values.

def update_opp_list(team_opp_list, home_team_name, away_team_name, home_team_poss_res, away_team_poss_res):
    # Make copies of the input dictionaries
    home_stats = home_team_poss_res.copy()
    away_stats = away_team_poss_res.copy()
    
    # Modify the copies
    away_stats['team_name'] = home_team_name + "_defense"
    home_stats['team_name'] = away_team_name + "_defense"
    
    # Update stats for home team
    if home_team_name in team_opp_list:
        for key, value in away_stats.items():
            if isinstance(value, (int, float)):
                team_opp_list[home_team_name][key] += value
    else:
        team_opp_list[home_team_name] = away_stats
        
    # Update stats for away team
    if away_team_name in team_opp_list:
        for key, value in home_stats.items():
            if isinstance(value, (int, float)):
                team_opp_list[away_team_name][key] += value
    else:
        team_opp_list[away_team_name] = home_stats
    
    return team_opp_list

Core data collection loop

The loop to collect all of the data is rather simple due to having so many helper functions. The play-by-play data is read into a pandas dataframe from a csv file. As the file consists of many games listed one after another, the pandas groupby() function is used to look at each individual game by game_id. When looking at an individual game, the home team and away team is determined. The teams, as well as dictionaries ready to hold their stats are passed to the handle_game function. The handle_game function returns those updated dictionaries, and they are passed to the update_stat_list and update_opp_list functions. After reading through the whole game, the game date and each team's possession statistics are recorded in the result_list array. After doing this for the whole dataset, the team_stat_list and team_opp_list dictionaries contains every team's season statistics.

# read in the play by play data
games = pd.read_csv("2024_play_by_play.csv")

# group the play by play data to be able to look at individual games
grouped = games.groupby('game_id')

# two dictionaries used to keep track of offensive and defensive possession statistics
team_stat_list = {}
team_opp_list = {}

# used to keep track of when games occured and the team statistics from that game
result_list = []

for game_id, group_data in grouped:
    
    home_team_id = group_data.iloc[0]["home_team_id"]
    away_team_id = group_data.iloc[0]["away_team_id"]

    home_team_name = group_data.iloc[0]["home_team_name"]
    away_team_name = group_data.iloc[0]["away_team_name"]
    
    game_date = group_data.iloc[0]["game_date"]
    
    # dictionaries to keep track of team possession statistics
    home_team_poss_res = {"team_name": home_team_name, "two_fgm": 0, "thr_fgm": 0, "fg_miss": 0, "ft": 0, "ft_miss": 0, "tov": 0, "oreb": 0, "got_fouled": 0, "poss": 0}
    away_team_poss_res = {"team_name": away_team_name, "two_fgm": 0, "thr_fgm": 0, "fg_miss": 0, "ft": 0, "ft_miss": 0, "tov": 0, "oreb": 0, "got_fouled": 0, "poss": 0}
    
    # handle game returns a list containing the possession statistics for each team
    game_stats = handle_game(group_data, home_team_id, away_team_id, home_team_poss_res, away_team_poss_res)
    
    # take note of the game date, and add a copy of each teams statistics from the game_stats array
    result_list.append([game_date, game_stats[0].copy(), game_stats[1].copy()])
    
    
    team_stat_list = update_stat_list(team_stat_list, home_team_name, away_team_name, game_stats[0], game_stats[1] )
    team_opp_list  = update_opp_list( team_opp_list,  home_team_name, away_team_name, game_stats[0], game_stats[1] )
    
# turn offensive and defensive stats into pandas DataFrames
team_off_stats = pd.DataFrame.from_dict(team_stat_list, orient='index')
team_def_stats = pd.DataFrame.from_dict(team_opp_list, orient='index')
team_off_stats.head()
team_name two_fgm thr_fgm fg_miss ft ft_miss tov oreb got_fouled poss
Eastern Washington Eastern Washington 620 274 889 525 156 422 327 628 2982
Montana Montana 725 278 1068 477 125 371 330 582 3135
Idaho Idaho 535 243 991 357 130 366 273 487 2705
Montana State Montana State 608 300 1041 414 159 391 290 589 3068
Idaho State Idaho State 623 218 1027 429 196 372 376 607 2970
team_def_stats.head()
team_name two_fgm thr_fgm fg_miss ft ft_miss tov oreb got_fouled poss
Eastern Washington Eastern Washington_defense 549 282 1076 454 158 385 396 587 3029
Montana Montana_defense 697 225 1159 509 197 349 397 606 3210
Idaho Idaho_defense 546 236 945 464 172 356 316 571 2780
Montana State Montana State_defense 678 222 1035 495 199 457 392 637 3201
Idaho State Idaho State_defense 665 200 954 422 144 415 294 549 2882

Data Processing

The goal of the data processing step is to take all of the data collected above and format it so that a multinomial logistic regression can be performed. This means taking aggregate possession data from previous games, turning it into percentages, and coming up with a categorical results column.

This code makes a dataframe to hold data (from each team's perspective) from every individual game. The data is read from the result_list array. The game date is noted, as well as each team's possession statistics from said game. One row is made in the dataframe from the perspective of the home team, and another row is made from the perspective of the away team. The end result is a dataframe with two entries for every game played.

# sort the list of results by game_date
result_list= sorted(result_list, key=lambda x: x[0])

stats_on_date = pd.DataFrame({})
for row in result_list:
    date = row[0]
    team1 = row[1]
    team2 = row[2]

    game = {"date": date, 
            
            "team_name": team1['team_name'], 
            "team_twos": team1['two_fgm'], 
            "team_threes": team1['thr_fgm'], 
            "team_miss": team1['fg_miss'],
            "team_ft": team1['ft'],
            "team_ft_miss": team1['ft_miss'],
            "team_tov": team1['tov'],
            "team_oreb": team1['oreb'],
            "team_fouled": team1['got_fouled'],
            "team_poss": team1['poss'],
            
            "opp_name": team2['team_name'], 
            "opp_twos": team2['two_fgm'], 
            "opp_threes": team2['thr_fgm'], 
            "opp_miss": team2['fg_miss'],
            "opp_ft": team2['ft'],
            "opp_ft_miss": team2['ft_miss'],
            "opp_tov": team2['tov'],
            "opp_oreb": team2['oreb'],
            "opp_fouled": team2['got_fouled'],
            "opp_poss": team2['poss']
           }
    
    opp_game = {"date": date, 
            
            "team_name": team2['team_name'], 
            "team_twos": team2['two_fgm'], 
            "team_threes": team2['thr_fgm'], 
            "team_miss": team2['fg_miss'],
            "team_ft": team2['ft'],
            "team_ft_miss": team2['ft_miss'],
            "team_tov": team2['tov'],
            "team_oreb": team2['oreb'],
            "team_fouled": team2['got_fouled'],
            "team_poss": team2['poss'],
            
            "opp_name": team1['team_name'], 
            "opp_twos": team1['two_fgm'], 
            "opp_threes": team1['thr_fgm'], 
            "opp_miss": team1['fg_miss'],
            "opp_ft": team1['ft'],
            "opp_ft_miss": team1['ft_miss'],
            "opp_tov": team1['tov'],
            "opp_oreb": team1['oreb'],
            "opp_fouled": team1['got_fouled'],
            "opp_poss": team1['poss']
           }
    
    # add values from home team's perspective
    new_row = pd.DataFrame.from_dict([game])
    stats_on_date = pd.concat([stats_on_date, new_row])
    
    # add values from away team's perspective
    new_row = pd.DataFrame.from_dict([opp_game])
    stats_on_date = pd.concat([stats_on_date, new_row])
    
stats_on_date = stats_on_date.reset_index(drop=True)
stats_on_date.head(10)
    
 
date team_name team_twos team_threes team_miss team_ft team_ft_miss team_tov team_oreb team_fouled team_poss opp_name opp_twos opp_threes opp_miss opp_ft opp_ft_miss opp_tov opp_oreb opp_fouled opp_poss
0 2023-11-06 Tulsa 16 8 36 14 6 18 20 18 100 Central Arkansas 14 6 40 7 5 12 11 15 89
1 2023-11-06 Central Arkansas 14 6 40 7 5 12 11 15 89 Tulsa 16 8 36 14 6 18 20 18 100
2 2023-11-06 Northwestern 20 5 34 17 4 12 16 19 94 Binghamton 15 7 31 10 2 19 10 16 90
3 2023-11-06 Binghamton 15 7 31 10 2 19 10 16 90 Northwestern 20 5 34 17 4 12 16 19 94
4 2023-11-06 Syracuse 23 5 39 22 5 11 17 22 105 New Hampshire 17 8 43 14 5 16 15 17 106
5 2023-11-06 New Hampshire 17 8 43 14 5 16 15 17 106 Syracuse 23 5 39 22 5 11 17 22 105
6 2023-11-06 Minnesota 19 5 22 27 8 17 13 29 100 Bethune-Cookman 19 4 47 10 4 14 23 16 104
7 2023-11-06 Bethune-Cookman 19 4 47 10 4 14 23 16 104 Minnesota 19 5 22 27 8 17 13 29 100
8 2023-11-06 Nebraska 16 11 29 19 10 9 8 21 92 Lindenwood 19 3 46 5 4 12 13 12 95
9 2023-11-06 Lindenwood 19 3 46 5 4 12 13 12 95 Nebraska 16 11 29 19 10 9 8 21 92

The following code comes up with the aggregate possession data. Feburary 1st is chosen as the point to start calculating game by game data. Every game before Feb 1 has its data summed up. Next, every game after Feb 1 is iterated through, the team's aggregate data is found and added to, and the game data for the current game is kept the same.

# Convert date column to datetime type
stats_on_date['date'] = pd.to_datetime(stats_on_date['date'])

start_date = '2024-02-01'

# Filter dataframe to include only games after or on the start date
filtered_df = stats_on_date[stats_on_date['date'] >= start_date]

# Initialize list to store rows for the new dataframe
new_rows = []

team_stats_columns = ["team_twos", "team_threes", "team_miss", "team_tov", "team_oreb", "team_fouled", "team_poss"]
opp_stats_columns = ["opp_twos", "opp_threes", "opp_miss", "opp_tov", "opp_oreb", "opp_fouled", "opp_poss"]

# Iterate through each game in the original dataframe
for index, game in filtered_df.iterrows():
    # Extract team name and opponent name for the current game
    team_name = game['team_name']
    opp_name = game['opp_name']

    # Calculate sum of stats for team and opponent based on games before the current game
    team_prev_sum = stats_on_date[stats_on_date['team_name'] == team_name].loc[:index-1][team_stats_columns].sum()
    opp_prev_sum = stats_on_date[stats_on_date['opp_name'] == opp_name].loc[:index-1][team_stats_columns].sum()

    # Extract team stats for the current game
    team_game_stat = game[team_stats_columns].tolist()

    # Combine all data into a single row
    new_row = [game['date'], team_name] + team_prev_sum.tolist() + [opp_name] + opp_prev_sum.tolist() + team_game_stat

    # Append row to the list
    new_rows.append(new_row)

# Create new dataframe with specified columns
new_columns = ['date', 'team_name'] + \
              [f'prev_{stat}' for stat in team_stats_columns] + \
              ['opp_name'] + \
              [f'prev_{stat}' for stat in opp_stats_columns] + \
              team_stats_columns

new_df = pd.DataFrame(new_rows, columns=new_columns)
pd.set_option('display.max_columns', None)

# Now, when you call new_df.head(), it will display all columns
new_df.head()
date team_name prev_team_twos prev_team_threes prev_team_miss prev_team_tov prev_team_oreb prev_team_fouled prev_team_poss opp_name prev_opp_twos prev_opp_threes prev_opp_miss prev_opp_tov prev_opp_oreb prev_opp_fouled prev_opp_poss team_twos team_threes team_miss team_tov team_oreb team_fouled team_poss
0 2024-02-01 Montana State 361.0 163.0 603.0 234.0 164.0 355.0 1803.0 Eastern Washington 345.0 156.0 675.0 253.0 252.0 374.0 1900.0 15 9 41 12 19 20 105
1 2024-02-01 Eastern Washington 370.0 188.0 570.0 265.0 212.0 372.0 1858.0 Montana State 394.0 128.0 594.0 266.0 231.0 392.0 1889.0 20 2 27 18 9 18 92
2 2024-02-01 Montana 439.0 157.0 643.0 215.0 203.0 327.0 1850.0 Idaho 335.0 158.0 609.0 227.0 196.0 345.0 1751.0 16 8 25 11 10 21 87
3 2024-02-01 Idaho 333.0 158.0 647.0 218.0 182.0 313.0 1729.0 Montana 404.0 132.0 678.0 209.0 230.0 351.0 1885.0 19 9 27 10 6 8 74
4 2024-02-01 Northern Colorado 421.0 169.0 634.0 220.0 205.0 336.0 1868.0 Idaho State 417.0 114.0 588.0 255.0 198.0 341.0 1793.0 19 9 30 10 5 22 94

This code takes the aggregate possession data and turns each possession result into the percentages of all the team's possessions.

poss_percent = new_df.copy()

for index, row in poss_percent.iterrows():
    if row['prev_team_poss'] > 0:
        prev_team_poss = row['prev_team_poss']
        
        
        poss_percent.at[index, 'prev_team_twos'] =   row['prev_team_twos'] / prev_team_poss
        poss_percent.at[index, 'prev_team_threes'] = row['prev_team_threes'] / prev_team_poss
        poss_percent.at[index, 'prev_team_miss'] =   row['prev_team_miss'] / prev_team_poss
        poss_percent.at[index, 'prev_team_tov'] =    row['prev_team_tov'] / prev_team_poss
        poss_percent.at[index, 'prev_team_oreb'] =   row['prev_team_oreb'] / prev_team_poss
        poss_percent.at[index, 'prev_team_fouled'] = row['prev_team_fouled'] / prev_team_poss
        
    if row['prev_opp_poss'] > 0:
        prev_opp_poss = row['prev_opp_poss']
        
        poss_percent.at[index, 'prev_opp_twos'] =   row['prev_opp_twos'] / prev_opp_poss
        poss_percent.at[index, 'prev_opp_threes'] = row['prev_opp_threes'] / prev_opp_poss
        poss_percent.at[index, 'prev_opp_miss'] =   row['prev_opp_miss'] / prev_opp_poss
        poss_percent.at[index, 'prev_opp_tov'] =    row['prev_opp_tov'] / prev_opp_poss
        poss_percent.at[index, 'prev_opp_oreb'] =   row['prev_opp_oreb'] / prev_opp_poss
        poss_percent.at[index, 'prev_opp_fouled'] = row['prev_opp_fouled'] / prev_opp_poss
        
        
poss_percent.head(6)
date team_name prev_team_twos prev_team_threes prev_team_miss prev_team_tov prev_team_oreb prev_team_fouled prev_team_poss opp_name prev_opp_twos prev_opp_threes prev_opp_miss prev_opp_tov prev_opp_oreb prev_opp_fouled prev_opp_poss team_twos team_threes team_miss team_tov team_oreb team_fouled team_poss
0 2024-02-01 Montana State 0.200222 0.090405 0.334443 0.129784 0.090960 0.196894 1803.0 Eastern Washington 0.181579 0.082105 0.355263 0.133158 0.132632 0.196842 1900.0 15 9 41 12 19 20 105
1 2024-02-01 Eastern Washington 0.199139 0.101184 0.306781 0.142626 0.114101 0.200215 1858.0 Montana State 0.208576 0.067761 0.314452 0.140815 0.122287 0.207517 1889.0 20 2 27 18 9 18 92
2 2024-02-01 Montana 0.237297 0.084865 0.347568 0.116216 0.109730 0.176757 1850.0 Idaho 0.191319 0.090234 0.347801 0.129640 0.111936 0.197030 1751.0 16 8 25 11 10 21 87
3 2024-02-01 Idaho 0.192597 0.091382 0.374205 0.126084 0.105263 0.181029 1729.0 Montana 0.214324 0.070027 0.359682 0.110875 0.122016 0.186207 1885.0 19 9 27 10 6 8 74
4 2024-02-01 Northern Colorado 0.225375 0.090471 0.339400 0.117773 0.109743 0.179872 1868.0 Idaho State 0.232571 0.063581 0.327942 0.142220 0.110429 0.190184 1793.0 19 9 30 10 5 22 94
5 2024-02-01 Idaho State 0.208425 0.073304 0.344639 0.133479 0.127462 0.193107 1828.0 Northern Colorado 0.199892 0.099406 0.355484 0.131821 0.103728 0.174500 1851.0 17 10 44 11 19 27 115

This cell makes the categorical result column. Every game is read through, and for each possession result from that games, a loop is created and every result gets its own line.

row_data = []
for index, row in poss_percent.iterrows():
    twos = row['team_twos']
    threes = row['team_threes']
    misses = row['team_miss']
    tov = row['team_tov']
    oreb = row['team_oreb']
    fouls = row['team_fouled']
    poss = row['team_poss']
    
    data = row.values.tolist()
    
    prev_data = data[0:17]
    
    for i in range(misses):
        res = [0]
        new_row = prev_data + res
        row_data.append(new_row)
    
    for i in range(oreb):
        res = [1]
        new_row = prev_data + res
        row_data.append(new_row)
    
    for i in range(twos):
        res = [2]
        new_row = prev_data + res
        row_data.append(new_row)
    
    for i in range(threes):
        res = [3]
        new_row = prev_data + res
        row_data.append(new_row)
        
    for i in range(tov):
        res = [4]
        new_row = prev_data + res
        row_data.append(new_row)
    
    for i in range(fouls):
        res = [5]
        new_row = prev_data + res
        row_data.append(new_row)
        
    
        
columns = ['date', 'team_name', 'prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled', 'prev_poss',
                    'opp_name', 'opp_twos',  'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled','opp_poss',
            'result']

game_results = pd.DataFrame(row_data, columns=columns)
game_results.tail(6)
date team_name prev_twos prev_threes prev_miss prev_tov prev_oreb prev_fouled prev_poss opp_name opp_twos opp_threes opp_miss opp_tov opp_oreb opp_fouled opp_poss result
425629 2024-04-08 Purdue 0.210003 0.087593 0.31307 0.119923 0.149489 0.216358 3619.0 UConn 0.185498 0.067851 0.394292 0.120268 0.128422 0.186372 3434.0 5
425630 2024-04-08 Purdue 0.210003 0.087593 0.31307 0.119923 0.149489 0.216358 3619.0 UConn 0.185498 0.067851 0.394292 0.120268 0.128422 0.186372 3434.0 5
425631 2024-04-08 Purdue 0.210003 0.087593 0.31307 0.119923 0.149489 0.216358 3619.0 UConn 0.185498 0.067851 0.394292 0.120268 0.128422 0.186372 3434.0 5
425632 2024-04-08 Purdue 0.210003 0.087593 0.31307 0.119923 0.149489 0.216358 3619.0 UConn 0.185498 0.067851 0.394292 0.120268 0.128422 0.186372 3434.0 5
425633 2024-04-08 Purdue 0.210003 0.087593 0.31307 0.119923 0.149489 0.216358 3619.0 UConn 0.185498 0.067851 0.394292 0.120268 0.128422 0.186372 3434.0 5
425634 2024-04-08 Purdue 0.210003 0.087593 0.31307 0.119923 0.149489 0.216358 3619.0 UConn 0.185498 0.067851 0.394292 0.120268 0.128422 0.186372 3434.0 5

Statistical Analysis

Multinomial Logistic Regression

A multinomial regression is a statistical method used to predict the outcome of a categorical dependent variable with more than two categories. This project uses input data of a team's offensive capabilities (measured by their percentage of possessions that end in missed shots, two-point shots, three-point shots, offensive rebounds, turnovers, and fouls), and another team's defensive capabilities (measured by the same results). By running a multinomial regression on a team's offense and team's defense, probabilities for the team's offensive possession ending in one of those results are calculated.

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression

# define the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)

# fit the model on the whole dataset
test_columns = ['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled', 'opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']
X = game_results[test_columns]
y = game_results['result']
model.fit(X, y)

# take a random row of data from the game_results dataframe to test
row = [0.210003, 0.087593, 0.31307, 0.119923, 0.149489, 0.216358, 0.185498, 0.067851, 0.394292, 0.120268, 0.128422, 0.186372]
row_df = pd.DataFrame([row], columns=test_columns)

# predict a multinomial probability distribution
results = model.predict_proba(row_df)

# summarize the predicted probabilities
print(f"Predicted Probabilities: {results[0]}")
Predicted Probabilities: [0.32276465 0.12013916 0.18664887 0.07507893 0.10356496 0.19180343]

Simulating Games

The game simulation is done by simulating alternating possessions between two teams. The team's possession result probabilities are obtained from the model built with multinomial regression. Probabilities are used from both teams' offense and defense in order to come up with the liklihood for each team scoring on a given possession. After simulating a predefined number of possessions, the scores are reported.

Get team possession probabilities

This function is used to determine each team's possession result probabilities. The first thing that is done is obtaining the most recent aggregate probabilities for each team's offense and defense. These are combined so that one team's offensive and the other team's defensive probabilities are grouped together. These are then passed to the model built on the multinomial regression, and the calculated values are returned.

# get team probabilities
def get_probs(team1, team2):
    
    team1_off = game_results[game_results['team_name'] == team1][['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled']].iloc[-1].tolist() 
    team2_off = game_results[game_results['team_name'] == team2][['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled']].iloc[-1].tolist()
    
    team1_def = game_results[game_results['opp_name'] == team1][['opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']].iloc[-1].tolist()
    team2_def = game_results[game_results['opp_name'] == team1][['opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']].iloc[-1].tolist()
        
    most_recent_team1 = team1_off + team2_def
    most_recent_team2 = team2_off + team1_def
    
    test_columns = ['prev_twos', 'prev_threes', 'prev_miss', 'prev_tov', 'prev_oreb', 'prev_fouled', 'opp_twos', 'opp_threes', 'opp_miss', 'opp_tov', 'opp_oreb', 'opp_fouled']
    
    
    team1_probs = model.predict_proba(pd.DataFrame([most_recent_team1], columns=test_columns)).tolist()[0]
    team2_probs = model.predict_proba(pd.DataFrame([most_recent_team2], columns=test_columns)).tolist()[0]
    
    return [team1_probs, team2_probs]

print("Example probabilities between North Carolina and Duke:")
get_probs("North Carolina", "Duke")
Example probabilities between North Carolina and Duke:
[[0.36136018257551633,
  0.10115549297568453,
  0.19492609017437149,
  0.07580382814234912,
  0.09460779242619718,
  0.17214661370588133],
 [0.3535532476244952,
  0.09711739050523685,
  0.2028816180158325,
  0.08113898256442995,
  0.0960341810888952,
  0.1692745802011103]]

Get team misc stats

Since the dataframe containing the aggregate possession result probabilities does not contain helpful information about free throw percentage or possession count, these numbers are obtained in the get_misc_stats function. The teams' free throw counts are obtained from the team_off_stats dataframe which contains season-level statistics, and the free throw percentage is calculated.

Next the average number of possessions per game for each team is determined. This is done with the stats_on_date dataframe. This dataframe contains game-level stats for every team, and the average possession count is found by simply calling the mean function on the possession column for each team.

def get_misc_stats(team1, team2):
    
    # grab free throw stats for teams
    team1_fts = team_off_stats.loc[team1][['ft', 'ft_miss']].tolist()
    team2_fts = team_off_stats.loc[team2][['ft', 'ft_miss']].tolist()
        
    # calculate free throw percentage
    team1_ft_p = team1_fts[0] / (team1_fts[0] + team1_fts[1])
    team2_ft_p = team2_fts[0] / (team2_fts[0] + team2_fts[1])
    
    # grab average number of possesions in a game
    team1_poss = round(stats_on_date.loc[stats_on_date['team_name'] == team1, 'team_poss'].mean())    
    team2_poss = round(stats_on_date.loc[stats_on_date['team_name'] == team2, 'team_poss'].mean())
    
    team1_stats = [team1_poss, team1_ft_p]
    team2_stats = [team2_poss, team2_ft_p]
    
    return [team1_stats, team2_stats]

print("Example of the misc stats collected for North Carolina and Duke:")
get_misc_stats("North Carolina", "Duke")
Example of the misc stats collected for North Carolina and Duke:
[[94, 0.7589073634204275], [87, 0.7227586206896551]]

Simulate a possession

The sim_poss function is called to simulate a possession and record its result during a game simulation. It works by simulating a team's offensive possession. That team's possession result probabilities are given as an argument to the function. Using those probabilities, a result is chosen with the random.choices method. Depending on the result, the score that the result of the play would finish in is returned.

import random
def adj(mean, mu):
    return random.normalvariate(mean, mu)

def sim_poss( probs ):
    while ( True ):
        # list of every option for a given possession
        options = ['fg_miss', 'two_pointer', 'three_pointer', 'turnover', 'foul']
        # list of probability for every possession option
        probabilities = [adj(probs[0], .0),      # miss
                         adj(probs[2], .00),     # two
                         adj(probs[3], .0),      # three
                         adj(probs[4], .00),     # tov
                         adj(probs[5], .00) ]    # foul
        

        # randomly choose possesion option
        result = random.choices(options, weights=probabilities, k=1)[0]

        # return how each possession option will affect the score
        if result == 'fg_miss':
            # if offensive rebound, keep current possession
            x = random.random()
            if (x < adj(probs[1], .008 ) ) : 
                pass
            else:
                return 0
            
        elif result == 'two_pointer':
            return 2
        
        elif result == 'three_pointer':
            return 3
        
        elif result == 'foul':
            ft_made = 0
            # simulate two free throw shots
            for i in range(2):
                x = random.random()
                if (x < adj(probs[6], .015) ):
                    ft_made += 1
            
            return ft_made
            
        else:
            return 0

Average function

Very simple helper function used to calculate the average score after lots of games are simulated

# used to compute the average score of lots of simulated games
def Average(x): 
    return sum(x) / len(x) 

Core game simulation loop

The sim_games function is the culmination of this project. It repeatedly simulates games between two user-given teams.

It first asks for the two teams that should be playing each other in the simulation. It uses the get_probs and get_misc_stats functions to obtain the teams' necessary statistics, and combines these into one array.

Next a number of games (given by the optional argument to the function) are simulated. A game is simulated by calling the sim_poss function until the sum of each team's average possession count has been reached. The scores of the game are noted, as well as a win count for each team. After the requested number of games has been reached (or ten games if no argument was given), the scores of each game are reported, as well as each team's average score and win count.

def sim_games(num_games = 10, team1 = "North Carolina", team2 = "Duke"):
    
#     team1 = input("Please enter team1:\n")
#     team2 = input("Please enter team2:\n")

#     if (team1 == ''):
#         team1 = 'North Carolina'
#     if (team2 == ''):
#         team2 = 'Duke'

    # grab team stats from helper functions
    probs = get_probs(team1, team2)
    misc  = get_misc_stats(team1, team2)

    # these contain the possession result probabilities
    team1_probs = probs[0]
    team2_probs = probs[1]

    # these contain average possession count and free throw percentage
    team1_misc = misc[0]
    team2_misc = misc[1]

    # add the two teams average possession count to get the possession for simulated game
    max_poss = team1_misc[0] + team2_misc[0]

    # make one array with the all of the teams necessary stats
    both_team_probs = [team1_probs + [team1_misc[1]], team2_probs + [team2_misc[1]] ]


    # used to keep track of the scores across multiple sims
    team1_scores = []
    team2_scores = []

    team1_wins = 0
    team2_wins = 0


    
    for i in range(num_games):
        scores = [0,0]
        curr_poss = 0
        while curr_poss < max_poss:
            team = curr_poss%2
            scores[team] += sim_poss(both_team_probs[team] )
            curr_poss+=1
            
        team1_scores.append(scores[0])
        team2_scores.append(scores[1])
        if (scores[0] > scores[1]):
            team1_wins += 1
        else:
            team2_wins += 1
        print(f"Game {i+1:2.0f}    {team1:>20} {scores[0]:3.0f} \t {scores[1]:3.0f}   {team2:<20} ")

    print(f"\nAverage {team1:^30} score: {Average(team1_scores):.2f}" + 
          f"\nAverage {team2:^30} score: {Average(team2_scores):.2f}")
    
    print(f"\n{team1:<30} wins: {team1_wins:2.0f}" + 
          f"\n{team2:<30} wins: {team2_wins:2.0f}")
sim_games(1, "Houston", "Arizona")
Game  1                 Houston  79 	  96   Arizona              

Average            Houston             score: 79.00
Average            Arizona             score: 96.00

Houston                        wins:  0
Arizona                        wins:  1
sim_games(team1="NC State", team2="North Carolina")
Game  1                NC State  98 	 104   North Carolina       
Game  2                NC State  97 	 102   North Carolina       
Game  3                NC State  82 	 109   North Carolina       
Game  4                NC State  95 	 105   North Carolina       
Game  5                NC State 101 	  90   North Carolina       
Game  6                NC State  81 	 110   North Carolina       
Game  7                NC State  82 	  89   North Carolina       
Game  8                NC State  75 	  87   North Carolina       
Game  9                NC State  89 	  75   North Carolina       
Game 10                NC State 100 	 102   North Carolina       

Average            NC State            score: 90.00
Average         North Carolina         score: 97.30

NC State                       wins:  2
North Carolina                 wins:  8
sim_games(20, "UConn", "Purdue")
Game  1                   UConn  86 	  89   Purdue               
Game  2                   UConn  95 	  97   Purdue               
Game  3                   UConn 111 	 101   Purdue               
Game  4                   UConn 107 	 100   Purdue               
Game  5                   UConn 115 	  88   Purdue               
Game  6                   UConn 107 	  84   Purdue               
Game  7                   UConn  92 	  90   Purdue               
Game  8                   UConn 100 	  80   Purdue               
Game  9                   UConn  94 	 102   Purdue               
Game 10                   UConn 109 	  88   Purdue               
Game 11                   UConn 106 	  96   Purdue               
Game 12                   UConn  89 	  98   Purdue               
Game 13                   UConn  93 	  89   Purdue               
Game 14                   UConn 107 	 110   Purdue               
Game 15                   UConn  90 	 105   Purdue               
Game 16                   UConn  96 	 106   Purdue               
Game 17                   UConn  91 	  79   Purdue               
Game 18                   UConn  92 	 104   Purdue               
Game 19                   UConn 101 	  98   Purdue               
Game 20                   UConn 109 	  88   Purdue               

Average             UConn              score: 99.50
Average             Purdue             score: 94.60

UConn                          wins: 12
Purdue                         wins:  8